In this assigment, I’m going to be taking the ranked data that was created in Assignment #2 and perform non-thresholded pathway analysis. Once that is performed, I will summarize/visualize my results using the Enrichment Map Pipeline. I will interpret the results and analyze “dark matter” present within the data.
I selected an expression dataset with GEO ID “GSE46562”, belonging to the paper “Integrative transcriptome network analysis of iPSC-derived neurons from schizophrenia and schizoaffective disorder patients with 22q11.2 deletion”1. This dataset has 9 control samples (no deletion syndrome) and 10 patient samples (deletion syndrome), counting their individual expression across over 15000 genes. The data had already been normalized via FPKM.
I used visualizations such as boxplots and mean-variance plots to present an overview of the dataset.
I performed identifier mapping by connecting the Ensembl ID of each gene present in the dataset to its respective HUGO symbol using the package biomaRt. I removed any duplicate UGO symbols or genes that did not map to a HUGO symbol.
I filtered and normalized the data. I removed genes that had less than 7 counts per million, and performed TMM normalization on the remaining genes.
Overall, the final dataset generated had the TMM normalized expression of 13918 genes, across 19 samples.
I created a dataset for the samples, which helps easily identify, based on identifier, what test condition a sample belongs to.
A design matrix was created, categorizing samples based on the experimental condition they belonged to.
For each gene present in the final dataset from Assingment 1, differential expression was calculated using DeSeq2, fitting with the created design matrix.
The Benjami-Hochberg method was used for multiple hypothesis testing and the number of genes significanty expressed and who passed correction was calculated. 362 genes were signigificantly differentially expressed (with a p-value threshold of 0.05) and 30 genes passed the multiple hypothesis correction method.
Volcano plots and Heatmaps were used to visualize the differential expression
Using the genes whose differential expression was calculated, upregulated (those with logFC > 0) and downregulated genes (those with logFC < 0) were separated into separate gene lists.
Each of the lists were subject to thresholded analysis by g::Profiler. This is used to identify biological themes for each list. The databases searched were Reactome, WikiPathways and GO::BP. Thresholded analysis was also performed on the total analysis dataset
From the thresholded analysis, the themes returned were the following:
Install all necessary packages
if(!requireNamespace("RCy3", quietly = TRUE)){
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
BiocManager::install("RCy3")
}
trying URL 'https://packagemanager.rstudio.com/all/__linux__/bionic/291/src/contrib/RJSONIO_1.3-1.4.tar.gz'
Content type 'binary/octet-stream' length 1464717 bytes (1.4 MB)
==================================================
downloaded 1.4 MB
trying URL 'https://packagemanager.rstudio.com/all/__linux__/bionic/291/src/contrib/igraph_1.2.5.tar.gz'
Content type 'binary/octet-stream' length 7451891 bytes (7.1 MB)
==================================================
downloaded 7.1 MB
trying URL 'https://bioconductor.org/packages/3.11/bioc/src/contrib/RCy3_2.8.1.tar.gz'
Content type 'application/x-gzip' length 4551590 bytes (4.3 MB)
==================================================
downloaded 4.3 MB
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
* installing *binary* package ‘RJSONIO’ ...
* DONE (RJSONIO)
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
* installing *binary* package ‘igraph’ ...
* DONE (igraph)
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
* installing *source* package ‘RCy3’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** help
*** installing help indices
** building package indices
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** installing vignettes
** testing if installed package can be loaded from temporary location
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** testing if installed package can be loaded from final location
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** testing if installed package keeps a record of temporary installation path
* DONE (RCy3)
The downloaded source packages are in
‘/tmp/RtmpQUynaW/downloaded_packages’
if(!requireNamespace("RCurl", quietly = TRUE)){
install.packages("RCurl")
}
if(!requireNamespace("kableExtra", quietly = TRUE)){
install.packages("kableExtra")
}
if(!requireNamespace("dplyr", quietly = TRUE)){
install.packages("dplyr")
}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
trying URL 'https://bioconductor.org/packages/3.11/bioc/src/contrib/genefilter_1.70.0.tar.gz'
Content type 'application/x-gzip' length 1419403 bytes (1.4 MB)
==================================================
downloaded 1.4 MB
trying URL 'https://bioconductor.org/packages/3.11/bioc/src/contrib/geneplotter_1.66.0.tar.gz'
Content type 'application/x-gzip' length 1430836 bytes (1.4 MB)
==================================================
downloaded 1.4 MB
trying URL 'https://bioconductor.org/packages/3.11/bioc/src/contrib/DESeq2_1.28.1.tar.gz'
Content type 'application/x-gzip' length 2053660 bytes (2.0 MB)
==================================================
downloaded 2.0 MB
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
* installing *source* package ‘genefilter’ ...
** using staged installation
** libs
g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c half_range_mode.cpp -o half_range_mode.o
gcc -I"/usr/local/lib/R/include" -DNDEBUG -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c init.c -o init.o
gcc -I"/usr/local/lib/R/include" -DNDEBUG -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c nd.c -o nd.o
gcc -I"/usr/local/lib/R/include" -DNDEBUG -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c pAUC.c -o pAUC.o
gcc -I"/usr/local/lib/R/include" -DNDEBUG -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c rowPAUCs.c -o rowPAUCs.o
gcc -I"/usr/local/lib/R/include" -DNDEBUG -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c rowttests.c -o rowttests.o
gfortran -fno-optimize-sibling-calls -fpic -g -O2 -c ttest.f -o ttest.o
g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o genefilter.so half_range_mode.o init.o nd.o pAUC.o rowPAUCs.o rowttests.o ttest.o -lgfortran -lm -lquadmath -L/usr/local/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/00LOCK-genefilter/00new/genefilter/libs
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** help
*** installing help indices
** building package indices
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** installing vignettes
** testing if installed package can be loaded from temporary location
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** testing if installed package keeps a record of temporary installation path
* DONE (genefilter)
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
* installing *source* package ‘geneplotter’ ...
** using staged installation
** R
** data
** inst
** byte-compile and prepare package for lazy loading
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** help
*** installing help indices
** building package indices
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** installing vignettes
** testing if installed package can be loaded from temporary location
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** testing if installed package can be loaded from final location
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** testing if installed package keeps a record of temporary installation path
* DONE (geneplotter)
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
* installing *source* package ‘DESeq2’ ...
** using staged installation
** libs
g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I'/usr/local/lib/R/site-library/Rcpp/include' -I'/usr/local/lib/R/site-library/RcppArmadillo/include' -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c DESeq2.cpp -o DESeq2.o
g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I'/usr/local/lib/R/site-library/Rcpp/include' -I'/usr/local/lib/R/site-library/RcppArmadillo/include' -I/usr/local/include -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c RcppExports.cpp -o RcppExports.o
g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o DESeq2.so DESeq2.o RcppExports.o -lopenblas -lgfortran -lm -lquadmath -L/usr/local/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/00LOCK-DESeq2/00new/DESeq2/libs
** R
** inst
** byte-compile and prepare package for lazy loading
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** help
*** installing help indices
** building package indices
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** installing vignettes
** testing if installed package can be loaded from temporary location
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
** testing if installed package keeps a record of temporary installation path
* DONE (DESeq2)
The downloaded source packages are in
‘/tmp/RtmpQUynaW/downloaded_packages’
if(!requireNamespace("GSA", quietly = TRUE)){
install.packages("GSA")
}
trying URL 'https://packagemanager.rstudio.com/all/__linux__/bionic/291/src/contrib/GSA_1.03.1.tar.gz'
Content type 'binary/octet-stream' length 98777 bytes (96 KB)
==================================================
downloaded 96 KB
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
* installing *binary* package ‘GSA’ ...
* DONE (GSA)
The downloaded source packages are in
‘/tmp/RtmpQUynaW/downloaded_packages’
if(!requireNamespace("VennDiagram", quietly = TRUE)){
install.packages("VennDiagram")
}
trying URL 'https://packagemanager.rstudio.com/all/__linux__/bionic/291/src/contrib/VennDiagram_1.6.20.tar.gz'
Content type 'binary/octet-stream' length 255828 bytes (249 KB)
==================================================
downloaded 249 KB
Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Bioconductor version '3.11' is out-of-date; the current release version '3.12'
is available with R version '4.0'; see https://bioconductor.org/install
* installing *binary* package ‘VennDiagram’ ...
* DONE (VennDiagram)
The downloaded source packages are in
‘/tmp/RtmpQUynaW/downloaded_packages’
if(!requireNamespace("ComplexHeatmap", quietly = TRUE)){
install.packages("ComplexHeatmap")
}
if(!requireNamespace("circlize", quietly = TRUE)){
install.packages("circlize")
}
library(RCy3)
library(RCurl)
library(kableExtra)
library(dplyr)
library(DESeq2)
library(GSA)
library(VennDiagram)
library(ComplexHeatmap)
library(circlize)
I will specifically be using the method of GSEA preranked analysis34.For this process, two types of files are required: a rnak file (rnk) and a geneset file (gmt)
The geneset files that I will be using is the most recent Human Geneset from the baderlab collection (version created on March 2021, containing GO biological process, no IEA and pathways). The format is a .gmt file.
gmtUrl = "http://download.baderlab.org/EM_Genesets/March_01_2021/Human/symbol/"
filenames = getURL(gmtUrl)
tc = textConnection(filenames)
Found more than one class "textConnection" in cache; using the first, from namespace 'BiocGenerics'
Also defined by ‘RJSONIO’
Found more than one class "textConnection" in cache; using the first, from namespace 'BiocGenerics'
Also defined by ‘RJSONIO’
Found more than one class "textConnection" in cache; using the first, from namespace 'BiocGenerics'
Also defined by ‘RJSONIO’
Found more than one class "textConnection" in cache; using the first, from namespace 'BiocGenerics'
Also defined by ‘RJSONIO’
Found more than one class "textConnection" in cache; using the first, from namespace 'BiocGenerics'
Also defined by ‘RJSONIO’
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred from
# electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents,
perl = TRUE)
gmtFile = unlist(regmatches(contents, rx))
download.file(paste(gmtUrl, gmtFile, sep = ""), destfile = "geneset.gmt") # save the file
trying URL 'http://download.baderlab.org/EM_Genesets/March_01_2021/Human/symbol/Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt'
Content type 'unknown' length 9406567 bytes (9.0 MB)
==================================================
downloaded 9.0 MB
First, it is required that I import the full list of genes that were analyzed in the differential expression from Assignment 2. The full analysis dataset is required as GSEA requires both strong and weak signals to perform a correct analysis.
Since DeSeq2 does not automatically calculate ranks of genes given their statistics (like packages such as edgeR). A documented way to range genes is to use the signed fold change metric (-log10(Pvalue) * sign(log fold change) and has been done in many RNA-seq analyses5. This method is used to ensure that all upregulated genes are given a positive rank, and all downregulated genes are given a negative rank.I will be saving the HGNC symbols along with their corresponding metric to the file ranked.rnk.
DeSeqOutput <- readRDS(file=file.path(getwd(),"data","DeSeqOutput.rds"))
DeSeqOutput$metric = -log10(DeSeqOutput$pvalue) * sign(DeSeqOutput$log2FoldChange)
NewDeSeq <- arrange(DeSeqOutput,desc(metric)) %>% select(1,8)
write.table(x=NewDeSeq,
file="data/ranked.rnk", sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
GSEAPreranked was performed on the web server GenePattern, with a minimum geneset size of 15 and a maximum geneset size of 200. All other parameters were left to the default.
| GS.br..follow.link.to.MSigDB | GS.DETAILS | SIZE | ES | NES | NOM.p.val | FDR.q.val | FWER.p.val | RANK.AT.MAX | LEADING.EDGE |
|---|---|---|---|---|---|---|---|---|---|
| HALLMARK_TNFA_SIGNALING_VIA_NFKB%MSIGDB_C2%HALLMARK_TNFA_SIGNALING_VIA_NFKB | Details ... | 120 | 0.6408505 | 2.457289 | 0 | 0.0000000 | 0.000 | 2709 | tags=54%, list=19%, signal=67% |
| REGULATION OF BLOOD VESSEL ENDOTHELIAL CELL MIGRATION%GOBP%GO:0043535 | Details ... | 69 | 0.6659053 | 2.340292 | 0 | 0.0000000 | 0.000 | 2333 | tags=45%, list=17%, signal=54% |
| POSITIVE REGULATION OF ENDOTHELIAL CELL MIGRATION%GOBP%GO:0010595 | Details ... | 87 | 0.6288707 | 2.292033 | 0 | 0.0000000 | 0.000 | 3603 | tags=54%, list=26%, signal=72% |
| REGULATION OF LEUKOCYTE PROLIFERATION%GOBP%GO:0070663 | Details ... | 79 | 0.6465517 | 2.291950 | 0 | 0.0000000 | 0.000 | 2169 | tags=41%, list=16%, signal=48% |
| REGULATION OF ENDOTHELIAL CELL MIGRATION%GOBP%GO:0010594 | Details ... | 130 | 0.6035616 | 2.253187 | 0 | 0.0002200 | 0.001 | 3152 | tags=46%, list=23%, signal=59% |
| NEGATIVE REGULATION OF LYMPHOCYTE ACTIVATION%GOBP%GO:0051250 | Details ... | 52 | 0.6803652 | 2.249718 | 0 | 0.0001834 | 0.001 | 2022 | tags=44%, list=15%, signal=52% |
| GTP HYDROLYSIS AND JOINING OF THE 60S RIBOSOMAL SUBUNIT%REACTOME%R-HSA-72706.2 | Details ... | 109 | 0.5987324 | 2.244383 | 0 | 0.0001572 | 0.001 | 4551 | tags=69%, list=33%, signal=101% |
| L13A-MEDIATED TRANSLATIONAL SILENCING OF CERULOPLASMIN EXPRESSION%REACTOME%R-HSA-156827.3 | Details ... | 108 | 0.6052276 | 2.240200 | 0 | 0.0001375 | 0.001 | 4551 | tags=70%, list=33%, signal=104% |
| REGULATION OF ANGIOGENESIS%GOBP%GO:0045765 | Details ... | 181 | 0.5601976 | 2.227956 | 0 | 0.0001222 | 0.001 | 2333 | tags=39%, list=17%, signal=46% |
| EUKARYOTIC TRANSLATION INITIATION%REACTOME%R-HSA-72613.3 | Details ... | 116 | 0.5876665 | 2.222993 | 0 | 0.0001100 | 0.001 | 4551 | tags=66%, list=33%, signal=98% |
Name: HALLMARK_TNFA_SIGNALING_VIA_NFKB%MSIGDB_C2%HALLMARK_TNFA_SIGNALING_VIA_NFKB
P-value: 0
ES: 0.6408505
NES: 2.4572892
FDR: 0
number of genes in leading edge: 120
DownRegData <- read.table(file = './data/new_downreg.tsv', sep = '\t', header = TRUE, fill = TRUE)
DownRegData$X = NULL
kable(head(DownRegData,10),format = "html",caption = "Table 2: Preview of report of downregulated genes",row.names = FALSE) %>% kable_styling()
| NAME | GS.br..follow.link.to.MSigDB | GS.DETAILS | SIZE | ES | NES | NOM.p.val | FDR.q.val | FWER.p.val | RANK.AT.MAX | LEADING.EDGE |
|---|---|---|---|---|---|---|---|---|---|---|
| 22Q11.2 COPY NUMBER VARIATION SYNDROME%WIKIPATHWAYS_20210210%WP4657%HOMO SAPIENS | 22Q11.2 COPY NUMBER VARIATION SYNDROME%WIKIPATHWAYS_20210210%WP4657%HOMO SAPIENS | Details ... | 74 | -0.9170594 | -3.126667 | 0.0000000 | 0.0000000 | 0.000 | 59 | tags=36%, list=0%, signal=36% |
| ADRENALINE AND NORADRENALINE BIOSYNTHESIS%PANTHER PATHWAY%P00001 | ADRENALINE AND NORADRENALINE BIOSYNTHESIS%PANTHER PATHWAY%P00001 | Details ... | 22 | -0.7859014 | -2.326897 | 0.0000000 | 0.2768456 | 0.503 | 294 | tags=23%, list=2%, signal=23% |
| SIGNAL RELEASE FROM SYNAPSE%GOBP%GO:0099643 | SIGNAL RELEASE FROM SYNAPSE%GOBP%GO:0099643 | Details ... | 57 | -0.6896297 | -2.294862 | 0.0000000 | 0.3045138 | 0.674 | 1944 | tags=47%, list=14%, signal=55% |
| NEUROTRANSMITTER SECRETION%GOBP%GO:0007269 | NEUROTRANSMITTER SECRETION%GOBP%GO:0007269 | Details ... | 57 | -0.6896297 | -2.259039 | 0.0000000 | 0.3759432 | 0.836 | 1944 | tags=47%, list=14%, signal=55% |
| SYNAPTIC VESICLE PATHWAY%WIKIPATHWAYS_20210210%WP2267%HOMO SAPIENS | SYNAPTIC VESICLE PATHWAY%WIKIPATHWAYS_20210210%WP2267%HOMO SAPIENS | Details ... | 39 | -0.7007261 | -2.206527 | 0.0000000 | 0.5882982 | 0.984 | 2157 | tags=54%, list=15%, signal=64% |
| FATTY ACYL-COA BIOSYNTHESIS%REACTOME%R-HSA-75105.6 | FATTY ACYL-COA BIOSYNTHESIS%REACTOME%R-HSA-75105.6 | Details ... | 27 | -0.7180764 | -2.177872 | 0.0082873 | 0.6834753 | 0.994 | 389 | tags=11%, list=3%, signal=11% |
| PID_REELIN_PATHWAY%MSIGDB_C2%PID_REELIN_PATHWAY | PID_REELIN_PATHWAY%MSIGDB_C2%PID_REELIN_PATHWAY | Details ... | 22 | -0.7659157 | -2.176639 | 0.0000000 | 0.5949982 | 0.995 | 170 | tags=14%, list=1%, signal=14% |
| SYNAPTIC VESICLE EXOCYTOSIS%GOBP%GO:0016079 | SYNAPTIC VESICLE EXOCYTOSIS%GOBP%GO:0016079 | Details ... | 34 | -0.7184112 | -2.174840 | 0.0027397 | 0.5303466 | 0.995 | 2195 | tags=59%, list=16%, signal=70% |
| IONOTROPIC GLUTAMATE RECEPTOR PATHWAY%PANTHER PATHWAY%P00037 | IONOTROPIC GLUTAMATE RECEPTOR PATHWAY%PANTHER PATHWAY%P00037 | Details ... | 38 | -0.6717642 | -2.174051 | 0.0054348 | 0.4759434 | 0.995 | 1565 | tags=39%, list=11%, signal=44% |
| FATTY-ACYL-COA BIOSYNTHETIC PROCESS%GOBP%GO:0046949 | FATTY-ACYL-COA BIOSYNTHETIC PROCESS%GOBP%GO:0046949 | Details ... | 25 | -0.7323793 | -2.171979 | 0.0108992 | 0.4397795 | 0.996 | 389 | tags=12%, list=3%, signal=12% |
Name: 22Q11.2 COPY NUMBER VARIATION SYNDROME%WIKIPATHWAYS_20210210%WP4657%HOMO SAPIENS
P-value: 0
ES: -0.9170594
NES: -3.1266673
FDR: 0
number of genes in leading edge: 74
When comparing to the thresholded analysis from Assignment 2, the results from the GSEAPreranked wasn’t exactly a straightforward comparsion and at points required inferences to make a correlation.
The top themes returned by the positive enrichment report (upregulated genes) in the threshholded analysis in Assignment 2 were cell migration and response to stimulus, which upon further research seemed to point towards immune response or response to various stresses. The non-threshholded analysis returned genesets pointing to regulation on endothethial cell migration which has an obvious correlation to the “cell migration” theme . The other genesets returned had names corresponding to leukocyte and lymphocyte proliferation and activation. Leukocyte and lymphocytes are types of white blood cells and therefore the regulation on their production could be in correlation to an pathogenic infection or type of immune response. Therefore, the correlation between these genesets and the terms of “response to stimulus” from the past analysis was prevalent.
The top themes returned by the negative enrichment report (downregulated genes) in the threshholded analysis were metacephalon development, cerbellar cortex formation and outflow tract morphogenesis, which are involved in brain development differentiation/function The non-threshholded analysis returned genesets on synaptic pathways, signal transfers and release and neurotransmitter secretion which are practically exclusive to brain function and development. Furthermore, both the non-thresholded and the thresholded analyses returned information on a WikiPathways term named “22Q11.2 COPY NUMBER VARIATION SYNDROME”, the syndrome of the test conditions samples in this dataset. Therefore, there was a strong correlation between the two analyses for the downregulated genes. However, there were some genesets (such as FATTY ACYL-COA BIOSYNTHESIS) that I had trouble finding a correlation between and will require further investigation.
The next step in this assignment was to take the reports of the positive and negative enrichment reports (the two tables shown above) and create an EnrichmentMap Network in Cytoscape. My dataset contained a myriad of genesets with nominal p-value 0.0 and the default parameters (p value threshold of 1.0 and FDR threshold of 0.1) returned an interaction network of almost 1500 nodes. Therefore, I decided to consult the EnrichmentMap documentation6. The documentation provided a list of possible thresholds and I decided to try the permutation for " conservative" of p < 0.005 and FDR < 0.05. I also did 50% overlap, which was recommended for most analyses. The resulting network was slightly cleaner, with 379 nodes and 1644 edges.
Next, I annotated the network using the app AutoAnnotate. I used the default parameters, selecting to create singleton clusters and spread out the annotation to avoid cluster overlap.
To make a publication ready figure, individual node labels were removed for clarity using the Enrichment Map app. I didn’t need to make any other changes, as the node clusters were already spread out for clarity.
The network was then collapsed to a theme network using AutoAnnotate. The resulting network had 134 nodes and 18 edges.
In a bizarre fashion, only one downregulated geneset (the WikiPathways geneset 22Q11.2 COPY NUMBER VARIATION SYNDROME) was present in the EnrichmentMap Network. I played around with the thresholds again and found that even at the default parameters, only one gene still showed. Therefore, it was almost impossible to infer the major themes of the downregulated genes.
Some of the most apparent themes in the upregulated genes are response interferon defense and regulation of immune cell production. Upon research on WikiPathways, I discovered this was directly related to the top geneset in the upregulated gene set: HALLMARK TNFA SIGNALING VIA NFKB. IN 20117, it was documented that NFKB (otherwise known as the nuclear factor kB), is a transcription factor that directly regulates the expression of genes involved in cell survival and immune responses. Interferons such as TNF-alpha (TNFA), the key players in activating a response to pathogenic infection, participate in a signaling pathway preceded by NFKB activation and ultimately leads to antiviral actvity and apoptosis. Not only are these two themes strongly relevant to immune response, I found that several other fo the major themes (surface wall chemokine, leukocyte chemotaxis migration, complex cyotoplasmic translation) were either pathways involved in the activation/signalling of immune cells or were pathways that took place during an immune response. Therefore, these two themes in the upregulated genesets strongly adhere to the theme of upregulated genes being involved in immune responses/cellular repsonse to stimuli and fit well with past analyses of the model.
A novel pathway present in the theme map were muscle tissue development and actin filament bundle, which I will have to do further research on. Other than that, there were no pathways inconsistent with previous analysis.
The enrichment results continue to support conclusions/mechanisms discussed in the original paper. As shown through our GSEApreranked, the top genesets returned for upregulated genes were related to immune reponse and top genesets returned for downregulated genes were related to synaptic vesicle pathways. It was mentioned in the paper that one of the top GO terms returned in their differential analysis was immune reponse (followed by programmed cell death), suggesting a molecular link between immune deficiency and additional disruption of synaptogenesis (formation of synapses between neurons) in 22q11.2 DS. It was also dicussed that the diminished dosage of the 22q11.2 genes in affected individuals may affect SZ brain development and act by disrupting mitochondrial function, particularly during activity-dependent synaptogenesis.1 These were very important findings present in the paper that was clearly reflected through the GSEA.
As shown in the header of “GSEA results” the results were pretty consistent among both the thresholded and the non-thresholded analyses. The one thing that differed was there was no mention of Primary Focal Segmental Glomerulosclerosis (FSGS), one of the top WikiPathways terms returned for Assignment 2.There was also no genesets returned referring to actual development of regions of the brain, such as cerebellar cortex development. There were only genesets that correlated to brain cell function.
One of the papers that I was able to find was by Morsheimer et. al8, outlining the fact that immunodeficiency is a central component of 22q11.2 deletion syndrome. The researchers concluded that the immunodeficiency is caused by a small thymus present in affected individuals. This affliction results in dysfunctional B cells, T cell exhaustion and high rates of atopy and autoimmunity. This results in the clear phenotype of a weakened immune system, which might support the evidence in the GSEA analysis that immune response is a major theme in differential expression.
I have chosen to highlight the genes that are significantly differentially expressed in the model but aren’t annotated to any pathways, otherwise known as Dark Matter.
The first files that I am going to retrieve are:
gmtFile <- file.path(getwd(), "geneset.gmt")
capture.output(genesets<- GSA.read.gmt(gmtFile),file="gsa_load.out")
Read 18727 records
Read 1270620 items
names(genesets$genesets) <- genesets$geneset.names
NormalizedData <- readRDS(file=file.path(getwd(),"data","normalized_counts.rds"))
expression <- NormalizedData
ranks <- NewDeSeq
enrFile1 <- read.table(file.path(getwd(),"data","new_upreg.tsv"),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
enrFile2 <- read.table(file.path(getwd(),"data","new_downreg.tsv"),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
FDRThreshold <- 0.001
#get the genes from the set of enriched pathways (no matter what threshold)
allSigEnrGenesets<- c(rownames(enrFile1)[which(enrFile1[,"FDR.q.val"]<=FDRThreshold)], rownames(enrFile2)[which(enrFile2[,"FDR.q.val"]<=FDRThreshold)])
genesSigEnrGs <- c()
for(i in 1:length(allSigEnrGenesets)){
current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% allSigEnrGenesets[i])])
genesSigEnrGs <- union(genesSigEnrGs, current_geneset)
}
genesAllGs <- unique(unlist(genesets$genesets))
There are 25730 unique genes in the geneset file.
I will now create a venn diagram of Dark Matter Overlaps
A <- genesAllGs
B <- genesSigEnrGs
C <- expression[,1]
png(file.path(getwd(),"figures","darkMatterOverlaps.png"))
draw.triple.venn(
area1=length(A),
area2=length(B),
area3 = length(C),
n12 = length(intersect(A,B)),
n13=length(intersect(A, C)),
n23 = length(intersect(B,C)),
n123 = length(intersect(A,intersect(B,C))),
category = c("all genesets", "all enrichment results", "expression"),
fill = c("red","green","blue"),
cat.col = c("red","green","blue")
)
(polygon[GRID.polygon.11], polygon[GRID.polygon.12], polygon[GRID.polygon.13], polygon[GRID.polygon.14], polygon[GRID.polygon.15], polygon[GRID.polygon.16], text[GRID.text.17], text[GRID.text.18], text[GRID.text.19], text[GRID.text.20], text[GRID.text.21], text[GRID.text.22], text[GRID.text.23], text[GRID.text.24])
Figure 5: Venn Diagram of Dark Matter Overlaps
Now, I must retrieve the sets of genes that have no annotation
genesNoAnnotation <- setdiff(expression[,1], genesAllGs)
Now, get the top ranked genes that have no annotation
rankedGeneNoAnnotation <- ranks[which(ranks[,1] %in% genesNoAnnotation),]
rownames(rankedGeneNoAnnotation) = NULL
rankedGeneNoAnnotation[1:10,]
names(rankedGeneNoAnnotation) = c("hgnc_symbol","rank")
kable(rankedGeneNoAnnotation[1:10,],format = "html",caption = "Table 1:Top ranked genes with no annotation",row.names = FALSE) %>% kable_styling()
| hgnc_symbol | rank |
|---|---|
| TSPEAR-AS1 | 2.686736 |
| BZW1 | 2.660069 |
| TSPEAR-AS2 | 2.321941 |
| PODNL1 | 2.299424 |
| SSC4D | 2.107570 |
| MAGED4B | 2.041607 |
| GTF2H2C | 1.941961 |
| CHMP1B-AS1 | 1.915361 |
| LINC00847 | 1.898574 |
| LINC02802 | 1.881050 |
Some of the top ranked genes shown in the “no annotation” table above are as follows. All information was found in GeneCards9: 1. TSPEAR-AS1/AS2 - genes associated with antisense RNA and diseases such as Deafness. I was able to find a slight correlation between the two diseases with 38% individuals with 22q11.2 DS (in a sample) were found to be afflicted with some sort of hearing loss 10.
BZW1 - protein coding gene associated with the bacterial disesase Ulceroglandular Tularemia.I wasn’t able to find any common symptoms between this disease and 22q11.2 DS.
PODNL1 - protein-coding gene that is thought to function to inhibit smooth muscle cell proliferation and migration following arterial injury. This could be related to 22q11.2 deletion syndrome, as a major theme in the analyses is cell migration, epithelial cell proliferation and mesenchymal transition
SSC4D - protein-coding gene whose antibodes are directly related to the development of the immune system and the regulation of both innate and adaptive immune responses. This has a clear correlation to the results of the enrichment analyses, which had major themes of immune response and regulation of immune cell proliferation.
MAGED4B - gene whose expression is specific to the brain. This is most likely related to the pathways found in the downregulated gene set, which had a major theme of brain development and function.
Now, two visualizations will follow of the genes that are not annotated:
genesNoAnnotation <- setdiff(rownames(expression), genesSigEnrGs)
rankedGeneNoAnnotation <- ranks[which(ranks[,1] %in% genesNoAnnotation),]
There will be 2212 in this set. I filtered using a significance threshold of p < 0.05
genesTable <- DeSeqOutput %>% filter(pvalue < 0.05)
topHits <- genesTable[which(rownames(genesTable) %in%rankedGeneNoAnnotation$hgnc_symbol),]
topHits <- rownames(topHits)
heatmapMatrix <- expression[,2:20]
heatmapMatrixTophits <- t(
scale(t(heatmapMatrix[which(rownames(heatmapMatrix) %in% topHits),])))
metainfo <- readRDS(file=file.path(getwd(),"data","samples.rds"))
orderedMeta <- metainfo %>% arrange(group)
orderedHeatmapMatrixTophits <- heatmapMatrixTophits[, orderedMeta$Patient.number]
if (min(orderedHeatmapMatrixTophits) == 0) {
HeatmapCol = colorRamp2(c(0,max(orderedHeatmapMatrixTophits)),
c("white","red"))
} else {
HeatmapCol = colorRamp2(c(min(orderedHeatmapMatrixTophits), 0,
max(orderedHeatmapMatrixTophits)),
c("blue","white","red"))
}
AllHeatmap <- Heatmap(as.matrix(orderedHeatmapMatrixTophits),
cluster_rows = TRUE,
show_row_dend = TRUE,
cluster_columns = FALSE,
show_column_dend = FALSE,
col=HeatmapCol,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE)
AllHeatmap
Figure 6: Corresponding Heatmap
# first, create a heatmap matrix of all expression data
genesNoAnnotation <- setdiff(rownames(expression), genesAllGs)
rankedGeneNoAnnotation <- ranks[which(ranks[,1] %in% genesNoAnnotation),]
There will be 2212 in this set I then further filtered using a significance threshold of p < 0.05.
genesTable <- DeSeqOutput %>% filter(pvalue < 0.05)
topHits <- genesTable[which(rownames(genesTable) %in%rankedGeneNoAnnotation$hgnc_symbol),]
topHits <- rownames(topHits)
heatmapMatrixTophits <- t(
scale(t(heatmapMatrix[which(rownames(heatmapMatrix) %in% topHits),])))
metainfo <- readRDS(file=file.path(getwd(),"data","samples.rds"))
orderedMeta <- metainfo %>% arrange(group)
orderedHeatmapMatrixTophits <- heatmapMatrixTophits[, orderedMeta$Patient.number]
if (min(orderedHeatmapMatrixTophits) == 0) {
HeatmapCol = colorRamp2(c(0,max(orderedHeatmapMatrixTophits)),
c("white","red"))
} else {
HeatmapCol = colorRamp2(c(min(orderedHeatmapMatrixTophits), 0,
max(orderedHeatmapMatrixTophits)),
c("blue","white","red"))
}
TopHeatmap <- Heatmap(as.matrix(orderedHeatmapMatrixTophits),
cluster_rows = TRUE,
show_row_dend = TRUE,
cluster_columns = FALSE,
show_column_dend = FALSE,
col=HeatmapCol,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE)
TopHeatmap
Figure 7: Corresponding Heatmap
There doesn’t appear that there was any clear clustering present in both heatmaps, in either groups of patient and control samples. There was a significant decrease in number of genes after the significance threshold of p < 0.05 (going from 2212 to 45 genes in the last heatmap). Therefore, the majority of genes not annotated in any of the pathways are weak signals/not significantly expressed.
Lin, M., Pedrosa, E., Hrabovsky, A. et al. Integrative transcriptome network analysis of iPSC-derived neurons from schizophrenia and schizoaffective disorder patients with 22q11.2 deletion. BMC Syst Biol 10, 105 (2016). https://doi.org/10.1186/s12918-016-0366-0
Love, M.I., Huber, W., Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Genome Biology 15(12):550 (2014)
Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)
Mootha, Lindgren, et al. (2003, Nat Genet 34, 267-273).
Sim CB, Ziemann M, Kaspi A, Harikrishnan KN, Ooi J, Khurana I, Chang L, Hudson JE, El-Osta A, Porrello ER. Dynamic changes in the cardiac methylome during postnatal development. FASEB J. 2015 Apr;29(4):1329-43. doi: 10.1096/fj.14-264093. Epub 2014 Dec 9. PMID: 25491312.
Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and InterpretationMerico D, Isserlin R, Stueker O, Emili A, Bader GD PLoS One. 2010 Nov 15;5(11):e13984.
Pfeffer LM. The role of nuclear factor κB in the interferon response. J Interferon Cytokine Res. 2011;31(7):553-559. doi:10.1089/jir.2011.0028
Morsheimer M, Brown Whitehorn TF, Heimall J, Sullivan KE. The immune deficiency of chromosome 22q11.2 deletion syndrome. Am J Med Genet A. 2017 Sep;173(9):2366-2372. doi: 10.1002/ajmg.a.38319. Epub 2017 Jun 19. PMID: 28627729.
Stelzer G, Rosen R, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, Iny Stein T, Nudel R, Lieder I, Mazor Y, Kaplan S, Dahary D, Warshawsky D, Guan - Golan Y, Kohn A, Rappaport N, Safran M, and Lancet D. The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analysis , Current Protocols in Bioinformatics(2016), 54:1.30.1 - 1.30.33.doi: 10.1002 / cpbi.5. [PDF]
Jiramongkolchai P, Kumar MS, Chinnadurai S, Wootten CT, Goudy SL. Prevalence of hearing loss in children with 22q11.2 deletion syndrome. Int J Pediatr Otorhinolaryngol. 2016 Aug;87:130-3. doi: 10.1016/j.ijporl.2016.06.005. Epub 2016 Jun 7. PMID: 27368459.
Reimand, J., Isserlin, R., Voisin, V. et al. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat Protoc 14, 482–517 (2019). https://doi.org/10.1038/s41596-018-0103-9